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Abstract 



In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a 
sparse solution to the generalized eigenvalue problem. We achieve this by constraining the 
cardinality of the solution to the generalized eigenvalue problem and obtain sparse prin- 
cipal component analysis (PC A), sparse canonical correlation analysis (CCA) and sparse 
Fisher discriminant analysis (FDA) as special cases. Unlike the £i-norm approximation 
to the cardinality constraint, which previous methods have used in the context of sparse 
PCA, we propose a tighter approximation that is related to the negative log-likelihood 
of a Student's t-distribution. The problem is then framed as a d.c. (difference of con- 
vex functions) program and is solved as a sequence of convex programs by invoking the 
majorization- minimization method. The resulting algorithm is proved to exhibit global 
convergence behavior, i.e., for any random initialization, the sequence (subsequence) of 
iterates generated by the algorithm converges to a stationary point of the d.c. program. 
The performance of the algorithm is empirically demonstrated on both sparse PCA (finding 
few relevant genes that explain as much variance as possible in a high-dimensional gene 
dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for 
music retrieval) applications. 

Keywords: Generalized eigenvalue problem, Principal component analysis, Canonical 
correlation analysis, Fisher discriminant analysis, D.c. program, Majorization-minimization, 
Global convergence analysis, Music annotation, Cross-language document retrieval. 

1. Introduction 

The generalized eigenvalue (GEV) problem for the matrix pair (A, B) is the problem of 
finding a pair (A, x) such that 



Ax = XBx 
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where A, Be C nxn , C n B x / and A € C. When B is an identity matrix, the problem in 
(1) is simply referred to as an eigenvalue problem. Eigenvalue problems are so fundamental 
that they have applications in almost every area of science and engineering (Strang, 1986). 

In multivariate statistics, GEV problems are prominent and appear in problems dealing 
with high-dimensional data analysis, visualization and pattern recognition. In these appli- 
cations, usually x £ R ra , A 6 S n (the set of symmetric matrices of size nxn defined over R) 
and B £ S™ + (set of positive definite matrices of size nxn defined over R). The variational 
formulation for the GEV problem in (1) is given by 

X max (A,B) = m&x{x T Ax : x T Bx = 1}, (GEV-P) 

where X max (A, B) is the maximum generalized eigenvalue associated with the matrix pair, 
(A,B). The x that maximizes (GEV-P) is called the generalized eigenvector associated 
with X max (A, B). Some of the well-known and widely used data analysis techniques that 
are specific instances of (GEV-P) are: 

(a) Principal component analysis (PCA) (Hotelling, 1933; Jolliffe, 1986), a classic tool 
for data analysis, data compression and visualization, finds the direction of maximal 
variance in a given multivariate data set. This technique is used in dimensionality 
reduction wherein the ambient space in which the data resides is approximated by a 
low-dimensional subspace without significant loss of information. The variational form 
of PCA is obtained by choosing A to be the covariance matrix (which is a positive 
semidefinite matrix defined over R) associated with the multivariate data and B to 
be the identity matrix in (GEV-P). 

(b) Canonical correlation analysis (CCA) (Hotelling, 1936), similar to PCA, is also a data 
analysis and dimensionality reduction method. However, while PCA deals with only 
one data space X (from which the multivariate data is obtained), CCA proposes a way 
for dimensionality reduction by taking into account relations between samples from 
two spaces X and y. The assumption is that the data points from these two spaces 
contain some joint information that is reflected in correlations between them. Direc- 
tions along which this correlation is high are thus assumed to be relevant directions 
when these relations are to be captured. The variational formulation for CCA is given 

max vZV xyW y ^ (2) 

where w x and w y are the directions in X and y along which the data is maximally 
correlated. ^ xx and ~S yy represent the covariance matrices for X and y respectively 
and "E xy = S^, represents the cross-covariance matrix between X and y. (2) can be 



rewritten as 



max{w x 'E xy Wy : w x 'E xx w x = 1, WyYiy y Wy = 1}, (3) 



which in turn can be written in the form of (GEV-P) with A 
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(c) In the binary classification setting, Fisher discriminant analysis (FDA) finds a one- 
dimensional subspace, w 6 M. n , the projection of data onto which leads to maximal 
separation between the classes. Let \Xi and XI j denote the mean vector and covariance 
matrix associated with class i. The variational formulation of FDA is given by 

O r (/,ti -/j 2 )) 2 

which can be rewritten as 

max w T (ni - /i2)(A A i - H2) T w 

w 

s.t. w T (Y,i + £ 2 )w = 1. (4) 

Therefore, the FDA formulation is similar to (GEV-P) with A = (hi — /x 2 )(/xi — 
\ called the between- cluster variance and B = Si + S 2 , called the within- cluster 
variance. For multi-class problems, similar formulations lead to multiple-discriminant 
analysis. 

Despite the simplicity and popularity of these data analysis and modeling methods, one 
key drawback is the lack of sparsity in their solution. They suffer from the disadvantage 
that their solution vector, i.e., x is a linear combination of all input variables, which often 
makes it difficult to interpret the results. In the following, we point to different applications 
where PCA/CCA/FDA is used and motivate the need for sparse solutions. 

In many PCA applications, the coordinate axes have a physical interpretation; in biology, 
for example, each axis might correspond to a specific gene. In these cases, the interpretation 
of the principal components would be facilitated if they contained only few non-zero entries 
(or, loadings) while explaining most of the variance in the data. Moreover, in certain appli- 
cations, e.g., financial asset trading strategies based on PCA techniques, the sparsity of the 
solution has important consequences, since fewer non-zero loadings imply fewer transaction 
costs. For CCA, consider a document translation application where two copies of a corpus 
of documents, one written in English and the other in German are given. The goal is to 
extract multiple low-dimensional representations of the documents, one in each language, 
each explaining most of the variation in the documents of a single language while maximiz- 
ing the correlation between the representations to aid translation. Sparse representations, 
equivalent to representing the documents with a small set of words in each language, would 
allow to interpret the underlying translation mechanism and model it better. In music 
annotation, CCA can be applied to model the correlation between semantic descriptions of 
songs (e.g., reviews) and their acoustic content. Sparsity in the semantic canonical compo- 
nents would allow to select the most meaningful words to describe musical content. This is 
expected to improve music annotation and retrieval systems. In a classification setting like 
FDA, feature selection aids generalization performance by promoting sparse solutions. To 
summarize, sparse representations are generally desirable as they aid human understanding, 
reduce computational and economic costs and promote better generalization. 

In this paper, we consider the problem of finding sparse solutions while explaining the 
statistical information in the data, which can be written as 

max{x T Ax : x T Bx = 1, ||cc||o < k}, (SGEV-P) 
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where 1 < k < n and ||x||o denotes the cardinality of x, i.e., the number of non-zero elements 
of x. The above program can be solved either as a continuous optimization problem after 
relaxing the cardinality constraint or as a discrete optimization problem. In this paper, we 
follow the former approach. The first step in solving (SGEV-P) as a continuous optimization 
problem is to approximate the cardinality constraint. One usual heuristic is to approximate 
||x||o by || as ||i (see Section 2 for the details on notation). Building on the earlier version 
of our work (Sriperumbudur et al., 2007), in this paper, we approximate the cardinality 
constraint in (SGEV-P) as the negative log-likelihood of a Student's t-distribution, which 
has been used earlier in many different contexts (Weston et al., 2003; Fazel et al., 2003; 
Candes et al., 2007). We then formulate this approximate problem as a d.c. (difference of 
convex functions) program and solve it using the majorization-minimization (MM) method 
(Hunter and Lange, 2004) resulting in a sequence of quadratically constrained quadratic 
programs (QCQPs). As a special case, when A is positive definite and B is an identity 
matrix (as is the case for PCA), a very simple iterative update rule (we call it as DC-PC A) 
can be obtained in a closed form, which has a per iteration complexity of 0(n 2 ). Since 
the proposed algorithm is an iterative procedure, using results from the global convergence 
theory of iterative algorithms (Zangwill, 1969), we show that it is globally convergent, i.e., for 
any random initialization, the sequence (subsequence) of iterates generated by the algorithm 
converges to a stationary point of the d.c. program (see Section 3.4 for a detailed definition). 
We would like to mention that the algorithm presented in this paper is more general than 
the one in Sriperumbudur et al. (2007) as it holds for any A G S n unlike in Sriperumbudur 
et al. (2007), where A is assumed to be positive semidefinite. 

We illustrate the performance of the proposed algorithm on sparse PCA and sparse 
CCA problems. On the sparse PCA front, we compare our results to SPCA (Zou et al., 
2006), DSPCA (d'Aspremont et al., 2007), GSPCA (Moghaddam et al., 2007) and GPower €o 
(Journee et al., 2008) in terms of sparsity vs. explained variance on the "pit props" bench- 
mark dataset and a random test dataset. Since DSPCA and GSPCA are not scalable for 
large-scale problems, we compare the performance of DC-PCA to SPCA and GPower^ on 
three high-dimensional gene datasets where the goal is to find relevant genes (as few as 
possible) while explaining the maximum possible variance. The results show that DC-PCA 
performs similar to most of these algorithms and better than SPCA, but at better computa- 
tional speeds. The proposed sparse CCA algorithm is used in two sparse CCA applications, 
one dealing with cross-language document retrieval and the other with vocabulary selection 
in music annotation. The cross-language document retrieval application involves a collec- 
tion of documents with each document in different languages, say English and French. The 
goal is, given a query string in one language, retrieve the most relevant document(s) in 
the target language. We experimentally show that the proposed sparse CCA algorithm 
performs similar to the non-sparse version, however using only 10% of non-zero loadings 
in the canonical components. In the vocabulary selection application, we show that sparse 
CCA improves the performance of a statistical musical query system by selecting only those 
words (i.e., pruning the vocabulary) that are correlated to the underlying audio features. 

The paper is organized as follows. We establish the mathematical notation in Section 2. 
In Section 3, we present the sparse generalized eigenvalue problem and discuss a tractable 
convex semidefinite programming (SDP) approximation. Since the SDP approximation is 
computationally intensive for large n, in Section 3.1, we present our proposed approximation 
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to the sparse GEV problem resulting in a d.c. program. This is then solved as a sequence of 
QCQPs in Section 3.3 using the majorization-minimization method that is briefly discussed 
in Section 3.2. The convergence analysis of the sparse GEV algorithm is presented in 
Section 3.4. Finally, in Sections 4 and 5, we derive sparse PCA and sparse CCA as special 
instances of the proposed algorithm and present experimental results to demonstrate the 
performance, while in Section 6, we discuss the applicability of the proposed algorithm to 
the sparse FDA problem. 

2. Notation 

S n (respectively denotes the set of symmetric (respectively positive semidefinite, 

positive definite) n x n matrices defined over R. For X G § n , X y (respectively X y 0) 
means that X is positive definite (respectively semidefinite). We denote a vector of ones 
and zeros by 1 and respectively. Depending on the context, will also be treated as a 
zero matrix. \X\ is the matrix whose elements are the absolute values of the elements of 
X. [X]ij denotes the (i,j) th element of X. For x = (x\,X2, ■ ■ ■ ,x n ) T G M. n , x y denotes 
an element- wise inequality. ||£c||o denotes the number of non-zero elements of the vector 
x, \\x\\ p := {Ya=i \ x i\ p ) 1 ^ p i 1 < p < oo and ||x|| oo • — maxi<j< n I n denotes an n x n 
identity matrix. D(x) represents a diagonal matrix formed with x as its principal diagonal. 

3. Sparse Generalized Eigenvalue Problem 

As mentioned in Section 1, the sparse generalized eigenvalue problem in (SGEV-P) can be 
solved either as a continuous optimization problem after relaxing the cardinality constraint 
or as a discrete optimization problem. In this section, we consider the former approach. 

Let us consider the variational formulation for the sparse generalized eigenvalue problem 
in (SGEV-P), where A £ §™ and B G + • Suppose A is not negative definite. Then 
(SGEV-P) is the maximization of a non-concave objective over the non-convex constraint 
set <3? := {x : x T Bx = 1} n {x : \\x\\q < k}. Although <3? can be relaxed to a convex 
set $ := {x : x T Bx < 1} n {x : \\x\\i < k}, it does not simplify the problem as the 
maximization of a non-concave objective over a convex set is still computationally hard and 
intractable [p. 342] (Rockafellar, 1970). 1 So, the intractability of (SGEV-P) is due to two 
reasons: (a) maximization of the non-concave objective function and (b) the constraint set 
being non-convex. Since (SGEV-P) is intractable, instead of solving it directly, one can 
solve approximations to (SGEV-P) that are tractable. Different tractable approximations 
to (SGEV-P) are possible, of which we briefly discuss the convex semidefinite programming 
(SDP) approximation and then motivate our proposed non-convex approximation. 

First, let us consider the following approximate program that is obtained by relaxing 
the non-convex constraint set $ to the convex set $, as described before: 

max{x T Ax : x G (5) 

1. Note that (GEV-P) also involves the maximization of a non-concave objective over a non-convex set. 
However, it is well-known that polynomial-time algorithms exist to solve (GEV-P), which is due to its 
special structure of a quadratic objective with a homogeneous quadratic constraint (Boyd and Vanden- 
berghe, 2004, p. 229). 
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As mentioned before, this program is still intractable due to the maximization of the non- 
concave objective. Had the objective function been linear, (5) would have been a canonical 
convex program, which could then be solved efficiently. One approach to linearize the 
objective function is by using the lifting technique (Lemarechal and Oustry, 1999, Section 
4.4), which was considered by d'Aspremont et al. (2005) when A y and B = I n . The 
lifted version of (5) is given by (see Appendix A for details): 

max ti(XA) 

X,x 

s.t. tr(XB) < 1, ||cc||i < k 

X = xx T . (6) 

Note that in the above program, the objective function is linear in X, and the constraints are 
convex except for the non-convex constraint, X = xx T (X = xx T X y 0, rank(JT) = 
1, where rank(JT) = 1 is a non-convex constraint and therefore X = xx T is a non-convex 
constraint). Relaxing X = xx to X — xx y results in the following program 

max tr(XA) 

X,x 

s.t. tr(XB) < 1, ||cc||i < k 

X - xx T y 0, (7) 

which is a semidefinite program (SDP) (Vandenberghe and Boyd, 1996). The ^i-norm 
constraint in (7) can be relaxed as \\x\\1 <k 2 ^> 1 | -X" 1 1 < k 2 so that the problem reduces 
to solving only for X. Therefore, we have obtained a tractable convex approximation to 
(SGEV-P). 

Although (7) is a convex approximation to (SGEV-P), it is computationally very in- 
tensive as general purpose interior-point methods for SDP scale as 0(n 6 log e _1 ), where e 
is the required accuracy on the optimal value. For large-scale problems, first-order meth- 
ods (Nesterov, 2005; d'Aspremont et al., 2007) can be used which scale as 0(e _1 n 4 v / Iogn). 
Therefore, the SDP-based convex relaxation to (SGEV-P) is prohibitively expensive in com- 
putation for large n. 

In the following section, we propose a different approximation to (SGEV-P), wherein 
instead of the ^-approximation to the cardinality constraint, we consider a non-convex 
approximation to it. We present a d.c. (difference of convex functions) formulation for 
this approximation to (SGEV-P), which is then solved as a sequence of QCQPs using the 
maj orization-minimization algorithm. 

3.1 Non-convex approximation to ||aj||o and d.c. formulation 

The proposed approximation to (SGEV-P) is motivated by the following observations. 

• Because of the non-concave maximization, a convex relaxation of the cardinality con- 
straint does not simplify (SGEV-P). So, a better approximation to the cardinality 
constraint than the tightest convex relaxation, i.e., ||a?||i, can be explored to improve 
sparsity. 

• Approximations that yield good scalability should be explored (as opposed to, e.g., 
the SDP approximation which scales badly in n). 
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To this end, we consider the regularized (penalized) version of (SGEV-P) given by 

max{x T Ax - p \\x\\ : x T Bx < 1}, (SGEV-R) 

where p > is the regularization (penalization) parameter. Note that the quadratic equality 
constraint, x Bx = 1 is relaxed to the inequality constraint, x T Bx < 1. Since 

ii ii V^n r + \ x i\/ £ ) f Q \ 

ll-llo = £ H^} = fag i og (i + i/ e) ' ^ 



(SGEV-R) is equivalent 2 to 



i=i 

2 



max 



a; Aa? - p hm > — — , , 

6-0^ Wl + l/e) 



- log(l + l/e) 

s.t. x^Ba; < 1. (9) 

The above program is approximated by the following approximate sparse GEV program by 
neglecting the limit in (9) and choosing e > 0, 

max ^-^M+ffi 

log(l + l/e) 

s.t. a; T Bic < 1, (10) 

which is equivalent to 

max |« T A« - p £ glogflzil + e) : ^ T .B« < l| , (SGEV-A) 

where p £ := p/log(l + e _1 ). Note that the approximate program in (SGEV-A) is a contin- 
uous optimization problem unlike the one in (SGEV-R), which has a combinatorial term. 
Before we present a d.c. program formulation to (SGEV-A), we briefly discuss the approx- 
imation to || a: ||o that we considered in this paper. 

Approximation to ||x||o: The approximation (to ||aj||o) that we considered in this paper, 
i.e., 

A log(l + \xi\e~ 1 ) 



x 



i.=-E 



l0g(l + E- 



has been used in many different contexts: feature selection using support vector machines 
(Weston et al., 2003), sparse signal recovery (Candes et al., 2007), matrix rank minimization 
(Fazel et al., 2003), etc. This approximation is interesting because of its connection to sparse 
factorial priors that are studied in Bayesian inference, and can be interpreted as defining 
a Student's t-distribution prior over x, an improper prior given by Y17=i \ x \+e ' Tipping 
(2001) showed that this choice of prior leads to a sparse representation and demonstrated 
its validity for sparse kernel expansions in the Bayesian framework. Other approximations 



2. Two programs are equivalent if their optimizers are the same. 
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to || a; || o are possible, e.g., Bradley and Mangasarian (1998) used £™ =1 (1 - e~ a ^) with 
a > (||a:||o = linic^oo £™ =1 (1 — e - "^ 4 ')) as an approximation to ||a:||o in the context of 
feature selection using support vector machines. 

We now show that the approximation (to ||#||o) considered in this paper, i.e., ||a;|| £ , is 
tighter than the £i-norm approximation, for any e > 0. To this end, let us define 

log(l + ae _1 ) 
° £ := log(l + e-i) ' 

where a > 0, so that ||aj|| e = Y17=i \ x i\s- ^ is easy to check that ||x||o = lim £ _,.o \\ x \\e 
and ||a;||i = lim £ ^ OQ \\x\\ £ . In addition, we have a > a £l > a £2 > . . . > 1 for a > 1 and 
1 > . . . > a £2 > a £l > a for < a < 1, if E\ > Si > . . ., i.e., for any a > and any 
< e < oo, the value a £ is closer to 1 than a is to 1. This means a £ for any < e < oo is a 
better approximation to l{ a ^o} than a is to l{ a ^o}- Therefore, ||aj|| e for any < e < oo is 
a better approximation to ||a;||o than ||a;||i is to \\x\\q. 
Let us define 

Q(x) := x T Ax — p||ic||o, 

n 

Q £ {x) := x T Ax - p £ s ^\og(l + \xi\e' 1 ) 

i=i 

and Q := {x : x T Bx < 1}. Note that Q(x) = lim £ ^o Qe( x ) f° r any fixed x, i.e., Q £ 
converges pointwise to Q. So, the sparse GEV problem is obtained as max{lim e ^o Qe( x ) '■ 
x £ £1}, while the approximate problem is given by max{Q £ (x) : x £ (]}. Suppose that x 
denotes a maximizer of Q(x) over Q and x £ denotes a maximizer of Q £ (x) over 0,. Now, one 
would like to know how good is the approximate solution, x £ compared to x. In general, it 
is not straightforward to either bound ||x e — x|| in terms of e or show that ||a3 e — S|| — > as 
e — > because Q(x) may be quite flat near its maximum over f2. At least, one would like 
to know whether Q £ {x £ ) — > Q(x) as e — ► 0, i.e., 

? 

limmaxQpfa;) = maxQfa;) = max lim QJx). (11) 

In other words, we would like to know whether the limit process and the maximization over 
Q can be interchanged. It can be shown that if Q £ converges uniformly over f2 to Q, then 
the equality in (11) holds. However, it is easy to see that Q £ does not converge uniformly 
to Q over S7, so nothing can be said about (11). 

D.c. formulation: Let us return to the formulation in (SGEV-A). To solve this continuous, 
non-convex optimization problem and derive an algorithm for the sparse GEV problem, we 
explore its formulation as a d.c. program. D.c. programs are well studied and many 
algorithms exist to solve them (Horst and Thoai, 1999). They are defined as follows. 

Definition 1 (D.c. program) Let VL be a convex set in W 1 . A real valued function f : 
Q — > R is called a d.c. function on Q, if there exist two convex functions g,h : 17 — > R such 
that f can be expressed in the form f(x) = g(x) — h(x), s 6 fi. Optimization problems of 
the form min{/o(a;) : x £ fl, fi{x) < 0, i = 1, . . . , m}, where fi = gi — hi, i = 0, . . . , m, are 
d.c. functions are called d.c. programs. 
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To formulate (SGEV-A) as a d.c. program, let us choose r£l such that A + rl n >z 0. If 
A y 0, such r exists trivially (choose r > 0). If A is indefinite, choosing r > —\ m i n {A) 
ensures that A + rl n >z 0. Therefore, choosing r > max(0, — A m j n (A)) ensures that A + 
Tl n h for any A G S n . (SGEV-A) is equivalently written as 

n 

min Ml 35 !!! ~~ x T (A + Tl n )x\ + p £ log (la; J + e) 

£C J * ' 

1=1 

s.t. a: T B;E < 1. (12) 
Introducing the auxiliary variable, y, yields the equivalent program 

n 

x T (A + Tl n )x - pe^2 log(y» + e) 

i=l 

s.t. x' J 5a; < 1, — y ^ x ^ y, (13) 

which is a d.c. program. Indeed, the term 7" || a; 1 1 1 is convex in x as r > and, by con- 
struction, x T (A + rl n )x — p £ Y^ii=i ^°s(yi + e ) is j° m tly convex in x and y. So, the above 
program is a minimization of the difference of two convex functions over a convex set. 
Global optimization methods like branch and bound, and cutting planes can be used to 
solve d.c. programs (Horst and Thoai, 1999), but are not scalable to large-scale prob- 
lems. Since (13) is a constrained nonlinear optimization problem, it can be solved by, e.g., 
sequential quadratic programming, augmented Lagrangian methods or reduced-gradient 
methods (Bonnans et al., 2006). In the following sections, we present an iterative algorithm 
to solve (13) using the majorization-minimization method. 

3.2 Majorization-minimization method 

The majorization-minimization (MM) method can be thought of as a generalization of the 
well-known expectation-maximization (EM) algorithm (Dempster et al., 1977). The general 
principle behind MM algorithms was first enunciated by the numerical analysts Ortega 
and Rheinboldt (1970) in the context of line search methods. The MM principle appears 
in many places in statistical computation, including multidimensional scaling (deLeeuw, 
1977), robust regression (Huber, 1981), correspondence analysis (Heiser, 1987), variable 
selection (Hunter and Li, 2005), sparse signal recovery (Candes et al., 2007), etc. We refer 
the interested reader to a tutorial on MM algorithms (Hunter and Lange, 2004) and the 
references therein. 

The general idea of MM algorithms is as follows. Suppose we want to minimize / over 
C M n . The idea is to construct a majorization function g over JlxSl such that 

f(x) < g(x, y), Vx, y £ and f(x) = g(x,x), Vx G fl. (14) 

Thus, g as a function of x is an upper bound on / and coincides with / at y. The 
majorization-minimization algorithm corresponding to this majorization function g updates 
x at iteration I by 

x^ +1 ) G argmin#(x,2^), (15) 



min "7" 1 1 35 1 1 2 — 
x,y 



9 



unless we already have 

€ argmin g(x, x^), 

in which case the algorithm stops. The majorization function, g is usually constructed by 
using Jensen's inequality for convex functions, the first-order Taylor approximation or the 
quadratic upper bound principle (Bohning and Lindsay, 1988). However, any other method 
can be used to construct g as long as it satisfies (14). It is easy to show that the above 
iterative scheme decreases the value of / monotonically in each iteration, i.e., 

f{x (l+1) ) < g{xQ +1 \ X V) < g(x®,x®) = f(x®), (16) 

where the first inequality and the last equality follow from (14) while the sandwiched in- 
equality follows from (15). 

Note that MM algorithms can be applied equally well to the maximization of / by 
simply reversing the inequality sign in (14) and changing the "min" to "max" in (15). 
In this case, the word MM refers to minorization-maximization, where the function g is 
called the minorization function. To put things in perspective, the EM algorithm can be 
obtained by constructing the minorization function g using Jensen's inequality for concave 
functions. The construction of such g is referred to as the E-step, while (15) with the "min" 
replaced by "max" is referred to as the M-step. The algorithm in (14) and (15) is used in 
machine learning, e.g., for non-negative matrix factorization (Lee and Seung, 2001), under 
the name auxiliary function method. Lange et al. (2000) studied this algorithm under the 
name optimization transfer while Meng (2000) referred to it as the SM algorithm, where 
"S" stands for the surrogate step (same as the majorization/minorization step) and "M" 
stands for the minimization/maximization step depending on the problem at hand, g is 
called the surrogate function. In the following, we consider an example that is relevant 
to our problem where we construct a majorization function, g, which will later be used in 
deriving the sparse GEV algorithm. 

Example 1 (Linear Majorization) Let us consider the optimization problem, min^n f(x) 
where f = u — v, with u and v both convex, and v continuously differentiable. Since v is 
convex, we have v(x) > v(y) + (x — y) T 'Vv(y), y x,y £ Q. Therefore, 

f(x) < u(x) - v(y) - (x- y) T X7v(y) =: g(x, y). (17) 

It is easy to verify that g is a majorization function of f . Therefore, we have 

x^ 1 ^ G argmin g(x,x^) = argmin u(x) — x T Vv(x^). (18) 

If O, is a convex set, then the above procedure solves a sequence of convex programs. Note 
that the same idea is used in the concave- convex procedure ( CCCP) (Yuille and Rangarajan, 
2003). 

Suppose u and v are strictly convex, then a strict descent can be achieved in (16) unless 

3.(1+1) = x (l) > Le _ ; if x (l+l) jL X {1) ) then 

/(as( ,+1 >) < g(x« +1 \x®) < g(x®,x®) = f(x®). (19) 

The first strict inequality follows from (17), a strict inequality for strictly convex v. Since 
u is strictly convex, g is strictly convex and therefore g(x^ l+1 \x^) < g(x^\x^) unless 
g;(H-i) = x (i) _ This strictly monotonic descent property will be helpful to analyze the con- 
vergence of the sparse GEV algorithm that is presented in the following section. 
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3.3 Sparse GEV algorithm 

Let us return to the approximate sparse GEV program in (12). Let 

n 

f( X ) = T\\X\\1 +PsY, lQ g( £ + N) - xT ( A + Tl n)x, (20) 
i=l 

where r > max(0, — X m i n (A)) so that (12) can be written as mm x ^Q f (x) and $1 = {x : 
x T Bx < 1}. The main idea in deriving the sparse GEV algorithm is in obtaining a 
majorization function, g that satisfies (14) and then using it in (15). The following result 
provides such a function g for / in (20). 

Proposition 2 The following function 

g(x, y) = t\\x\\1 - 2x T (A + rl n )y + y T {A + rl n )y + p £ V log(e + \ Vl \) + p £ Y] 7 ~ V l \ 

(21) 

majorizes f in (20). 

Proof Consider the term log(e + \xi\) in /. Using the inequality log(z) < z — 1, Vz G M+ 
with z = tN^, we have 

log(g + Ixil) <log( £ + | y< |)+ ^ ' "J^ 1 , Vg,y. (22) 

On the other hand, since A + tJ„ >z 0, by Example 1 with u(x) = t - 1 1 a? 1 1 2 and v(x) = 
x T (A + rl n )x, we have 

r||aj||l - x T (A + Tl n )x < t\\x\\ 2 2 - y T (A + rl n )y - 2(x - yf(A + rl n )y, Vaj, y. (23) 

From (22) and (23), it is easy to check that g in (21) majorizes / over R n x M. n and therefore 
over Q x Q, where Q = {x : x T Bx < 1}. ■ 



By following the minimization step in (15) with g as in (21), the sparse GEV algorithm 
is obtained as 

x (i+i) = argmin J T \\ x \\2 _ 2a; T (A + Tl n )x {l) + p e V — JpJ — : x r Bx < 1 1 , (ALG) 
" I ttK ] \+e j 

which is a sequence of quadratically constrained quadratic programs (QCQPs) (Boyd and 
Vandenberghe, 2004). It is clear that x^ l+1 ^ is the unique optimal solution of (ALG) ir- 
respective of whether r is zero or not. 3 (ALG) can also be obtained by applying linear 
majorization (see Example 1) to (13). See Appendix B for details. 

3. Suppose The objective function in (ALG) is strictly convex in x and therefore *< ;+1 > is the unique 

optimal solution. When r = 0, the objective function is linear in x and the unique optimum lies on the 
boundary of the constraint set. 
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Assuming t / and denning wp := — py — , := . . . , w$) and := 

D(w^), a diagonal matrix with as its principal diagonal, (ALG) reduces to 



jjC+i) _ ar g m j n 



x - (t 1 A + /„) 



(0 



2 + * 

2 r 



W (0 



a; 



s.t. x T Bx < 1. (24) 

(24) is very similar to LASSO (Tibshirani, 1996) except for the weighted £i-penalty and the 
quadratic constraint. When is chosen such that x^ = al, then the first iteration of 
(24) is a LASSO minimization problem except for the quadratic constraint. Let us analyze 
(24) to get an intuitive interpretation. 

(a) p £ = p = 0: (24) reduces to min{||a: — s^||| • xT Bx < 1}, where 8® = (r~ l A + 
I n )x (l \ So, if G {a; : cc T Sa: < 1}, then a^ +1 ) = gW, else aj( ,+1 ) = (J n + 
^)B)- l s^ l \ where satisfies [s«]r(j n+/U (/+i)_ B )-i S (/ ri+/i (i+i)_ B )-i s (0 = \ 

The first term in the objective of (24) computes the best approximation to in the 
^2-norm so that the approximation lies in the ellipsoid x T Bx < 1. We show in 
Corollary 6 that the iterative algorithm in (24) with p £ = converges to the solution 
of (GEV-P) and therefore, the solution x is non-sparse. 

(b) p £ = p = oo: In this case, (24) reduces to min {||WWaj||i : x T Bx < 1}, which is a 
weighted ^i-norm minimization problem. Intuitively, it is clear that if x^p is small, 
its weighting factor, wp = (|x^| +e) _1 in the next minimization step is large, which 

therefore pushes xf +1 ^ to be small. This way the small entries in x are generally 
pushed toward zero as far as the constraints on x allow, therefore yielding a sparse 
solution. 

From the above discussion, it is clear that (24) is a trade-off between the solution to the 
GEV problem and the solution to the weighted ^i-norm problem. From now on, we refer 
to (ALG) as the Sparse GEV algorithm, which is detailed in Algorithm 1. 

To run Algorithm 1, p, r and e need to be chosen. In a supervised learning setup like 
FDA, p can be chosen by cross-validation whereas, in an unsupervised setup like PCA/CCA, 
Algorithm 1 has to be solved for various p and the solution with desired cardinality is 
selected. Since p is a free parameter, r and e can be set to any value (that satisfies the 
constraints in Algorithm 1) and p can be tuned to obtain the desired sparsity as mentioned 
above. However, it has to be noted that for a fixed value of p, increasing r or e reduces 
sparsity. 4 So, in practice r is chosen to be max(0, — A m j n (A)), e to be close to zero and p 
is set by searching for a value that provides the desired sparsity. 

Suppose that Algorithm 1 outputs a solution, x* such that ||x*||o = k. Can we say 
that x* is the optimal solution of (SGEV-P) among all x with cardinality fc? The following 
proposition provides a condition to check for the non-optimality of x*. In addition, it 



4. Increasing e increases the approximation error between ||x|| and ^"=1 '"figti+g- 1 ) ^ anc ^ therefore re- 
duces sparsity. From (24), it is clear that increasing r reduces the weight on the term ||W^x||i, which 
means more importance is given to reducing the approximation error, \\x — (r~ 1 A + I n )x^ |||, leading 
to a less sparse solution. 
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Algorithm 1 Sparse Generalized Eigenvalue Algorithm 



Require: A € B y 0, e > and p > 

1: Choose r > max(0, — X m i n (A)) 
Choose aj(°) G {x : x T Bx < 1} 

Set ^ = logCl+e- 1 ) 

if r = then 



repeat 

.(0 - 



w 



„(0i 



WW = D(ti>«) 



a; 



(i+i) 



arg max 

S.t. 



2 



i T Ba; < 1. 



x 



(25) 



until convergence 
else 
repeat 

™? ] = (|xf ) | + e)- 1 
WW = £>(k; w ) 



a; 



(i+i) 



arg mm 

X 

S.t. 



X 



(r 1 A + J n )a: 



(0 



x 1 Bx < 1. 



2 r 



a; 



(26) 



until convergence 
end if 
return 



also presents a post-processing step (called variational renormalization) that improves the 
performance of Algorithm 1. See Moghaddam et al. (2007, Proposition 2) for a similar 
result in the case of A y and B = I n . 

Proposition 3 Suppose Algorithm 1 converges to a solution x* such that ||x*||o = k. 
Let z be the sub-vector of x* (obtained by removing the zero entries of x* ) and = 
arg max{ x T Af c x : x T B^x = 1}, where A& and B^ are submatrices of A and B defined by 
the same non-zero indices of x* . If z ^ u^, then x* is not the optimal solution of (SGEV- 
P) among all x with the same sparsity pattern as x* (and therefore, is not the optimal 
solution of (SGEV-P) among all x with \\x\\q = k). Nevertheless, by replacing the non-zero 
entries of x* with those of u^, the value of the objective function in (SGEV-P) increases 
from [x*] T Ax* to \(Ak, Bk), its optimal value among all x with the same sparsity pattern 
as x* . 

Proof Assume that x*, the solution output by Algorithm 1, is the optimal solution of 
(SGEV-P). Define v such that V{ = l{| 2 .*|^o}- Since x* is the optimal solution of (SGEV- 
P), we have x* = org m&x{y T D(v) AD (v)y : y T D(v)BD(v)y = 1}, which is equivalent 
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to z = arg max{w T A k w : w T B k w = 1} = u k and the result follows. Note that X(Ak, B k ) 
is the optimal value of (SGEV-P) among all x with the same sparsity pattern as x*. There- 
fore, if z = u k , then [x*] T Ax* = X(A k , B k ). ■ 

The variational renormalization suggests that given a solution (in our case, x* at the termi- 
nation of Algorithm 1), it is almost certainly better to discard the loadings, keep only the 
sparsity pattern and solve the smaller unconstrained subproblem to obtain the final load- 
ings, given the sparsity pattern. This procedure surely improves any continuous algorithm's 
performance. 

In Algorithm 1, we mention that the iterative scheme is continued until convergence. 
What does convergence mean here? Does the algorithm really converge? If it converges, 
what does it converge to? Does it converge to an optimal solution of (SGEV-A)? To address 
these questions, in the following section, we provide the convergence analysis of Algorithm 1 
using tools from global convergence theory (Zangwill, 1969). 

3.4 Convergence analysis 

For an iterative procedure like Algorithm 1 to be useful, it must converge to point solu- 
tions from all or at least a significant number of initialization states and not exhibit other 
nonlinear system behaviors, such as divergence or oscillation. Global convergence theory of 
iterative algorithms (Zangwill, 1969) can be used to investigate this behavior. We mention 
up front that this does not deal with proving convergence to a global optimum. Using this 
theory, recently, Sriperumbudur and Lanckriet (2009) analyzed the convergence behavior 
of the iterative algorithm in (18) and showed that under certain conditions on u and v, 
the algorithm in (18) is globally convergent, i.e., for any random initialization, x^\ the 
sequence of iterates {x^}fl converges to some stationary point 5 of the d.c. program, 
minjcgpX^a:) — v i x ))- Since (ALG) can be obtained by applying linear majorization to 
(13), as shown in Appendix B, the convergence analysis of (ALG) can be carried out (see 
Theorem 5) by invoking the results in Sriperumbudur and Lanckriet (2009). We show in 
Theorem 5 that Algorithm 1 is globally convergent. In Corollary 6, we then show that 
Algorithm 1 converges to the solution of the GEV problem in (GEV-P) when p = 0. In the 
following, we introduce some notation and terminology and then proceed with the derivation 
of the above mentioned results. 

The convergence analysis of an iterative procedure like Algorithm 1 uses the notion 
of a set-valued mapping, or point-to-set mapping, which is central to the theory of global 
convergence. A point-to-set map ^ from a set X into a set Y is defined as ^ : X — > &(Y), 
which assigns a subset of Y with each point of X, where &(Y) denotes the power set of Y. 
^ is said to be uniformly compact on X if there exists a compact set H independent of x 
such that fy(x) C H for all x £ X. Note that if X is compact, then iff is uniformly compact 
on X. A fixed point of the map iff : X — ► &(X) is a point x for which {x} = iff(x). 

Many iterative algorithms in mathematical programming can be described using the 
notion of point-to-set maps. Let A be a set and xq G X a given point. Then an algorithm, 
A, with initial point xq is a point-to-set map A : X — ► g?{X) which generates a sequence 

5. x* is said to be a stationary point of a constrained optimization problem if it satisfies the correspond- 
ing Karush-Kuhn- Tucker (KKT) conditions (Bonnans et al., 2006, Section 13.3). Assuming constraint 
qualification, KKT conditions are necessary for the local optimality of x*. 
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{xk}^ =1 via the rule xt+i £ A{xk), k = 0, 1, .A is said to be globally convergent if for any 

chosen initial point xq, the sequence {xk} ( ^Lo generated by xt+i € A(xk) (or a subsequence) 
converges to a point for which a necessary condition of optimality holds: the Karush-Kuhn- 
Tucker (KKT) conditions in the case of constrained optimization and stationarity in the 
case of unconstrained optimization. The property of global convergence expresses, in a 
sense, the certainty that the algorithm works. It is very important to stress the fact that it 
does not imply (contrary to what the term might suggest) convergence to a global optimum 
for all initial points xq- 

We now state the convergence result for (18) by Sriperumbudur and Lanckriet (2009), 
using which we provide the convergence result for (ALG). 

Theorem 4 (Sriperumbudur and Lanckriet (2009)) Consider the program, 

min{u(x) — v(x) : x £ £1}, (DC) 

where £1 = {x : q(x) < 0, i £ [m], dj(x) = 0, j G [p]} and [m] := {1, . . . ,m}. Let u and v be 
strictly convex, differentiable functions defined on W 1 . Also assume Vt> is continuous. Let 
{ci} be differentiable convex functions and {dj} be affine functions on W 1 . Suppose (DC) 
is solved iteratively as G Ad c {x^), where Ad c is the point-to-set map defined as 

Adc(y) = argmin u(x) — x T v(y). (DC-ALG) 

Let {x^}fZ be any sequence generated by Ad c defined by (DC-ALG). Suppose Ad c is uni- 
formly compact on Q and Ad c {x) is nonempty for any x E CI. Then, assuming suitable 
constraint qualification, all the limit points of {x^}f2. are stationary points of the d.c. pro- 
gram in (DC), u(x®) - v(x®) -► u(xA - v(xA =: /* as I ->■ oo, for some stationary point 
— x^\\ — ► 0, and either {x®}f2_ converges or the set of limit points of {x^}f2. 
is a connected and compact subset ofS fi (f*), where -5^(a) := {x G y : u(x) — v(x) = a} and 
y is the set of stationary points of (DC). If y(f*) is finite, then any sequence {x^}f^ 
generated by Ad c converges to some in S"(f*). 

The following global convergence theorem for (ALG) is obtained by simply invoking The- 
orem 4 with ft = {(x,y) : x T Bx < 1, — y < x < y}, u(x) = t - || a? || § and v(x,y) = 
x T {A + Tl n )x - p £ YJU log(yi + e). 

Theorem 5 (Global convergence of sparse GEV algorithm) Let{x^}fZ be any se- 
quence generated by the sparse GEV algorithm in Algorithm 1. Then, all the limit points of 
o are stationary points of the program in (SGEV-A), 

n n 

p e J>g(e + \xf ] \) - [x^fAxW -+ p £ J>g( £ + |<|) - [x*] T Ax* := L* , (27) 

i=i i=i 

for some stationary point x* , \\x^ l+1 ^ —x^ || — > 0, and either {x^}f^ converges or the set of 
limit points of {x^}f^. Q is a connected and compact subset of y{L*), where <5^{a) := {x € 
y : x T Ax — p £ J27=i l°g( e +l x j|) = — a l and y is the set of stationary points of (SGEV-A). 
If y{L*) is finite, then any sequence {x^}fl Q generated by Algorithm 1 converges to some 
x* in y(L*). 
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Proof As noted before, (ALG) can be obtained by applying linear majorization to (13), 
which is equivalent to (SGEV-A), with Q = {(x,y) : x T Bx < 1, —y < x ■< y}, u(x) = 
-7- 1 1 as 1 1 2 and v(x, y) = x T (A + rl n )x — p £ J27=i l°g(yi + £ )- It is easy to check that u and 
v satisfy the conditions of Theorem 4. Since Algorithm 1 and (ALG) are equivalent, let 
A correspond to the point-to-set map in (ALG). Clearly {x : x T Bx < 1} is compact and 
therefore A is uniformly compact. By the Weierstrass theorem 6 (Minoux, 1986), it is clear 
that A(x) is nonempty for any x € {x : x T Bx < 1}. The result therefore follows from 
Theorem 4. ■ 

Having considered the convergence of Algorithm 1, we now consider the convergence of 
special cases of Algorithm 1. Note that p = implies p £ = 0. Using this in (SGEV-A) 
yields the GEV problem in (GEV-P). Since Algorithm 1 is derived based on (SGEV-A), 
it would be of interest to know whether the sequence {x^}f2,Q generated by Algorithm 1 
for p = converges to the solution of (GEV-P). The following corollary answers this and 
shows that when p = 0, the solution of the sparse GEV algorithm (Algorithm 1) matches 
with that of the GEV problem in (GEV-P). 

Corollary 6 Let p = and \ max (A, B) > 0. 7 Then, any sequence {x^}fl generated by 
Algorithm 1 converges to some x* such that X max (A, B) = [x*] T Ax* and [x*] T Bx* = 1. 

Proof The stationary points of (SGEV-A) with p £ = p = are the generalized eigenvectors 
of (A, B). Therefore, the set 5? as defined in Theorem 5 is finite and any sequence {x^}f^ Q 
generated by Algorithm 1 converges to some x* in y{L*) where L* = — [x*] T Ax* . We need 
to show that L* = — X max (A, B). Note that x* is a fixed point of Algorithm 1. Consider 
(ALG) which is equivalent to Algorithm 1. With p £ = 0, solving the Lagrangian yields 
3.(2+1) = (^('+ 1 )_B + tI h )~ 1 (A + Tl n )x^ l \ where p( l+1 ^ > is the Lagrangian dual variable 
for the constraint [a^ +1 )] T iW +1 ) < 1. At the fixed point, x* , we have (p*B + rl n )x* = 
(A + rl n )x* which implies 

Ax*=p*Bx*. (28) 
Multiplying both sides of (28) by [x*] T ', we have 

[x*] T Ax* = p*[x*] T Bx* = p*([x*] T Bx* - 1) +p* 

= M*j (29) 

where we have invoked the complementary slackness condition, p*([x*] T Bx* — 1) = 0. The 
optimum value of (ALG) at the fixed point is given by V* := -2[x*] T Ax* — 7"||a;*|||, which 
by (29) reduces to ip* = —2p* — t\\x*\\2- It is easy to see that making p* > 0, and therefore 
[x*] T Bx* = 1 minimizes ip* instead of choosing p* = and [x*] T Bx* < 1. Since tp* is 
minimized by choosing the maximum p* that satisfies (28), (p*, x*) is indeed the eigen pair 
that satisfies the GEV problem in (GEV-P). ■ 

In addition to p = 0, suppose A >z and r = 0. Then, the following result shows that a 
simple iterative algorithm can be obtained to compute the generalized eigenvector associated 
with X max (A,B). 



6. The Weierstrass theorem states: If / is a real continuous function on a compact set if C R™, then the 
problem min{/(a;) : x G K} has an optimal solution x* G K. 

7. If \ max (A, B) > 0, then umx{x T Ax : x T Bx < 1} = ma,x{x T Ax : x T Bx = 1} = X max (A, B). 
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Algorithm 2 Sparse PCA algorithm (DC-PCA) 
Require: A y 0, e > and p > 

1: Choose x (0) G {a; : x T x < 1} 

2: Set ^ = log(lt-i) 

3: repeat 

4: «,« = (| x f)| +e )-l 
5: 







+ sign((A£cW)i) 




/eiu 


|(Ax«), 


- $v>F 


2 
+ 



6: until convergence 
7: return x^' 



Corollary 7 Zei A ^ 0, r = and p = 0. T/ien, ant/ sequence {x^}^ generated by the 
following algorithm 

x^ = B ~ 1AX(1) (30) 

converges to some x* such that X max (A, B) = [x*] T Ax* and [x*] T Bx* = 1. 

Proof Consider (ALG) with r = and p £ = p = 0. Since the objective is linear in x, the 
minimum occurs at the boundary of the constraint set, i.e., {x : x T Bx = 1}. Solving the 
Lagrangian, we get (30). The result therefore follows from Corollary 6 which holds for any 
r > 0. ■ 

So far, we have proposed a sparse GEV algorithm and proved its global convergence 
behavior. In the following sections (Sections 4-6), we consider specific instances of the 
sparse GEV problem and use the proposed algorithm (Algorithm 1) to address them. 

4. Sparse Principal Component Analysis 

In this section, we consider sparse PCA as a special case of the sparse GEV algorithm that 
we presented in Section 3.3. Based on the sparse GEV algorithm in Algorithm 1, we propose 
a sparse PCA algorithm, called DC-PCA, with A y being the covariance matrix, B = I n 
and r = 0. In this setting, (ALG) reduces to a very simple iterative rule, which is proved 
in Appendix C: 

"KAx^l-^l sign((Ax«).) 

, Vi, (ALG-S) 



J'+l) - 


\(Ax(% 




+ sign((Ax«) i ) 


I 


/E-=i 


|(Ax«) 


H 2 W i 


2 
+ 



where [a] + := max(0, a). The corresponding sparse PCA algorithm (DC-PCA) is shown 
in Algorithm 2. Note that the computation of x^ l+1 \ from x®, involves computing Ax®, 
which has a complexity of 0(n 2 ). Therefore, the DC-PCA algorithm has a per iteration 
complexity of 0(n 2 ). Since Algorithm 1 exhibits the global convergence behavior and DC- 
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PCA is a special case of Algorithm 1, it follows that DC-PCA exhibits the global convergence 
property. 

Suppose p £ = p = 0. Then, with A >z and B = I n , (SGEV-R) reduces to: 

max{a; T Ax : x T x = 1}, (EV-P) 

i.e., a standard eigenvalue problem. Therefore, it is of interest to know whether Algorithm 2 
provides a solution of (EV-P) when p = 0. It follows from Corollary 7 that DC-PCA 
converges to a solution of (EV-P) when p = 0. In addition, the following result shows that 
DC-PCA reduces to the power method for computing X max (A) when p = 0. 

Proposition 8 (Power method) Suppose p = 0. Then, Algorithm 2 is the power method 
for computing X max (A). 

Proof Setting p = in Algorithm 2 yields 

(m) _ AxU 

" \\Ax<n\\ 2 ' {61) 

which is the power iteration for the computation of X max (A). ■ 

Before proceeding further, we briefly discuss the prior work on sparse PCA algorithms. 
The earliest attempts at "sparsifying" PCA consisted of simple axis rotations and com- 
ponent thresholding (Cadima and Jolliffe, 1995) for subset selection, often based on the 
identification of principal variables (McCabe, 1984). The first true computational tech- 
nique, called SCoTLASS (Jolliffe et al., 2003), provided an optimization framework us- 
ing LASSO (Tibshirani, 1996) by enforcing a sparsity constraint on the PCA solution by 
bounding its ^i-norm, leading to a non-convex procedure. Zou et al. (2006) proposed a 
^i-penalized regression algorithm for PCA (called SPCA) using an elastic net (Zou and 
Hastie, 2005) and solved it very efficiently using least angle regression (Efron et al., 2004). 
Subsequently, d'Aspremont et al. (2007) proposed a convex relaxation to the non-convex 
cardinality constraint for PCA (called DSPCA) leading to a SDP with a complexity of 
0(n 4 v / kigre). Although this method shows performance comparable to SPCA on a small- 
scale benchmark data set, it is not scalable for high-dimensional data sets, even possibly 
with Nesterov's first-order method (Nesterov, 2005). Moghaddam et al. (2007) proposed a 
combinatorial optimization algorithm (called GSPCA) using greedy search and branch-and- 
bound methods to solve the sparse PCA problem, leading to a total complexity of 0(n 4 ) 
for a full set of solutions (one for each target sparsity between 1 and n). d'Aspremont 
et al. (2008) formulated a new SDP relaxation to the sparse PCA problem and derived a 
more efficient greedy algorithm (compared to GSPCA) for computing a full set of solutions 
at a total numerical complexity of 0(n 3 ), which is based on the convexity of the largest 
eigenvalue of a symmetric matrix. Recently, Journee et al. (2008) proposed a simple, iter- 
ative and very efficient sparse PCA algorithm (GPower^ ) with a per iteration complexity 
of 0(n 2 ), which is based on the idea of linear majorization. They showed that it performs 
similar to many of the above mentioned algorithms. Therefore, to our knowledge, GPower^ 
is the state-of-the-art. 

In the following, we discuss how DC-PCA relates to SCoTLASS (Jolliffe et al., 2003), 
SPCA (Zou et al., 2006) and GPower^ (Journee et al., 2008). We then present experiments 
to empirically compare different approaches. 
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4.1 Comparison to SCoTLASS 

As mentioned before, the SCoTLASS program is obtained by approximating ||x||o with 
|| as || i in (SGEV-P) and is given by 

max x 1 Ax 

X 

s.t. ||x||| = 1, || as || x < k, (33) 
where A >z 0. Let us consider the regularized version of the above program, given by 

max x 1 Ax — p\\x\\i 

X 

s.t. ||z||| < 1. (34) 

It is clear that (34) is not a canonical convex program because of the convex maximization. 
Applying the MM algorithm to (34), we obtain the following iterative algorithm: 

x^ l+V> = argmax x T Ax® - §r\\x\\i 
x 2 

s.t. ||z||| < 1. (35) 
Using an approach as in Appendix C, (35) can be solved in closed form as 

(, + i) _ [\(Ax(%\-§] + S ign((Ax(% 



Vn=i[|(A= (,) )i|-§]: 



, Vi. (36) 



Now, let us compare (36) with the DC-PCA iteration in Algorithm 2. For ScoTLASS, 
xf +1) = if \(Ax®)i\ < f , whereas for DC-PCA, xf +1) = if |(Ax«);| < f wf\ This 

means that if = for some I, then DC-PCA ensures that x\ m ^ = 0, Vm > / which is not 
guaranteed for SCoTLASS. Therefore, DC-PCA ensures faster convergence of an irrelevant 
feature to zero than SCoTLASS, thus providing better sparsity. This is not surprising as 
a better approximation to the cardinality constraint is used in DC-PCA. It can be shown 
that when p = 0, like DC-PCA, SCoTLASS also reduces to the power iteration algorithm 
in (32). 



4.2 Comparison to SPCA 

Let Q be a r x n matrix, where r and n are the number of observations and the number of 
variables respectively, with the column means being zero. Suppose Q has an SVD given by 
Q = UAV T , where U contains the principal components of unit length and the columns 
of V are the corresponding loadings of the principal components. Let yi = [t/A]j, Vi. 
Zou et al. (2006, Theorem 1) posed PCA as a regression problem and showed that [V]j = 
x*/||sc*||2, where 

x* = argmin \\yi — Qx\\\ + A||x|||, (37) 

X 

where A > 0. This is equivalent to solving for an eigenvector of Q T Q =: A. Therefore, 
solving for the eigenvectors of a positive semidefinite matrix is posed as a ridge regression 
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problem in (37). To solve for sparse eigenvectors, Zou et al. (2006) introduced an £i-penalty 
term in (37) resulting in the following elastic net called SPCA, 

x' = argmin — Qx\\\ + A||x||| + Ai||x||i, (38) 

X 

where Ai > 0. This problem can be interpreted in a Bayesian setting as follows: given the 
likelihood on yi, yi\x,a 2 ~ Q{Qx,a 2 1), which is a circular normal random variable with 
mean Qx (conditioned on x), and a prior distribution on x, x\{3 2 , 7 ~ G(0, (3 2 I) exp(— j\xi 
which is the product of a circular Gaussian and a product of Laplacian densities, compute 
the maximum a posteriori (MAP) estimate of x. It is easy to see that the penalization 
parameters A and Ai in (38) are related to a 2 , f3 2 and 7. As aforementioned, our approach 
can be interpreted as defining an improper prior over x, which promotes sparsity (Tip- 
ping, 2001). We use p(x) oc f\i | a -}+ £ (instead of FJ i exp(— 7|xj|)) as the prior such that 
x\(3 2 ,e ~ Q(0, (3 2 I)p(x). Replacing the prior in (38) with our prior, results in 

min||j/j - Qx\\l + X\\x\\l + Ai^log(|xi| + e). (39) 

i 

Since the problem in (39) is equivalent to (SGEV-A) with B = I n , it is clear that DC-PCA 
can be expected to provide sparser solutions than SPCA because of the prior p(x) that 
promotes sparsity. It is to be noted that the SPCA framework is not extendible to other 
settings like FDA or CCA unlike our formulation which is generic. 

4.3 Comparison to GPowerf 

Consider the following regularized sparse PCA program where A y 0: 

max{ x T Ax — P\\x\\q : x T x < 1}. (40) 

In our case, we approximated ||a:||o by ||a;|| £ and posed the approximate program as a d.c. 
program, resulting in DC-PCA obtained through majorization-minimization. On the other 
hand, without using any approximations to ||a:||o, Journee et al. (2008) showed that the 
solution to (40) can be obtained as: 



[sign((cf z) 2 - p)] 



T 

cj Z 



xi= , L — - J + t ^ = ,y l , (41) 

y/TZ=i [sign((cfzy-p)] + (cfzy 

where A = C T C, C is a p x n matrix, q is the i th column of C and z is the solution to 
the following program which is the maximization of a convex function over an ellipsoid: 



n 



max 



^[(cfz) 2 -p] + : = 1, z e VP . (42) 
U=i J 
(42) is then solved iteratively (called GPower^J using a simple gradient-type scheme result- 
ing in the following update rule: 



11 

z^) = £ [sign((cf z«)2 _ ~ p) 



i=i 

,(/+!) 
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Table 1: Loadings for first three sparse principal components (PCs) of the pit props data. 

The SPCA and DSPCA loadings are taken from Zou et al. (2006) and d'Aspremont 
et al. (2007) respectively. 
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It can be shown that their gradient-type scheme is equivalent to an MM-type algorithm. 
Therefore, our method differs from GPower£ only in a small way. This is also confirmed 
empirically in the following subsection where DC-PCA and GPower^ exhibit similar per- 
formance. We would like to mention that unlike our sparse generalized eigenvalue algorithm 
(Algorithm 1), GPower£ cannot be readily extended to settings like FDA and CCA. 

4.4 Experimental results 

In this section, we illustrate the effectiveness of DC-PCA in terms of sparsity and scalability 
on various datasets. On small datasets, the performance of DC-PCA is compared against 
SPCA, DSPCA, GSPCA and GPower£ , while on large datasets, DC-PCA is compared to 
all these algorithms except DSPCA and GSPCA due to scalability issues. Since GPower^ 
has been compared to the greedy algorithm of d'Aspremont et al. (2008) by Journee et al. 
(2008), wherein it is shown that these two algorithms perform similarly except for the greedy 
algorithm being computationally more complex, we do not include the greedy algorithm in 
our comparison. The results show that the performance of DC-PCA is comparable to the 
performance of many of these algorithms, but with better scalability. The experiments in this 
paper are carried out on a Linux 3 GHz, 4 GB RAM workstation. On the implementation 
side, we fix e to be the machine precision in all our experiments, which is motivated from 
the discussion in Section 3.1. 

4.4.1 Pit props data 

The pit props dataset (Jeffers, 1967) has become a standard benchmark example to test 
sparse PCA algorithms. The first 6 principal components (PCs) capture 87% of the total 
variance. Therefore, the explanatory power of sparse PCA methods is often compared on 
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the first 6 sparse PCs. Table 1 shows the first 3 sparse PCs and their loadings for SPCA, 
DSPCA, GSPCA, GPower A) and DC-PCA. Using the first 6 sparse PCs, SPCA captures 
75.8% of the variance with a cardinality pattern of (7,4,4,1,1,1), which indicates the 
number of non-zero loadings for the first to the sixth sparse PC, respectively. This results in 
a total of 18 non-zero loadings for SPCA, while DSPCA captures 75.5% of the variance with 
a sparsity pattern of (6, 2, 3, 1, 1, 1), totaling 14 non-zero loadings. With a sparsity pattern 
of (6, 2, 2, 1, 1, 1) (total of only 13 non-zero loadings), DC-PCA, GSPCA and GPower^, can 
capture 77.1% of the total variance. Comparing the cumulative variance and cumulative 
cardinality, Figures l(a-b) show that DC-PCA explains more variance with fewer non- 
zero loadings than SPCA and DSPCA. In addition, its performance is similar to that of 
GSPCA and GPower£ . For the first sparse PC, Figure 1(c) shows that DC-PCA consistently 
explains more variance with better sparsity than SPCA, while performing similar to other 
algorithms. Figure 1(d) shows the variation of sparsity and explained variance with respect 
to p for the first sparse PC computed with DC-PCA. This plot summarizes the method for 
setting p: the algorithm is run for various p and the value of p that achieves the desired 
sparsity is selected. 

4.4.2 Random test problems 

In this section, we follow the experimental setup that is considered in Journee et al. (2008). 
Throughout this section, we assume A = C T C, where C is a p x n random matrix whose 
entries are generated according to a Gaussian distribution, with zero mean and unit vari- 
ance. In the following, we present the trade-off curves (proportion of explained variance vs. 
cardinality for the first sparse PC associated with A), computational complexity vs. cardi- 
nality and computational complexity vs. problem size for various sparse PCA algorithms. 

Trade-off curves. Figure 2(a) shows the trade-off between the proportion of explained 
variance and the cardinality for the first sparse PC associated with A for various sparse 
PCA algorithms. For each algorithm, the sparsity inducing parameter (k in the case of 
DSPCA and GSPCA, and the regularization parameter in the case of SPCA, GPower^, 
and DC-PCA) is incrementally increased to obtain the solution vector with cardinality that 
decreases from n to 1. The results displayed in Figure 2(a) are averages of computations on 
100 random matrices with dimensions p = 100 and n = 300. It can be seen from Figure 2(a) 
that DC-PCA performs similar to DSPCA, GSPCA and GPower£ , while performing better 
than SPCA. 

Computational complexity vs. Cardinality. Figure 2(b) shows the average time 
required by the sparse PCA algorithms to extract the first sparse PC of A with p = 100 
and n = 300, for varying cardinality. It is obvious from Figure 2(b) that as the cardinality 
increases, GSPCA tends to get slower while the speed of SPCA, GPower£ and DC-PCA is 
not really affected by the cardinality. We did not show the results of DSPCA in Figure 2(b) 
as its computational complexity is an order of magnitude (around 100 times) more than 

8. The discussion so far dealt with computing the first sparse eigenvector of A. To compute the subsequent 
sparse eigenvectors, usually the sparse PCA algorithm is applied to a sequence of deflated matrices. See 
Mackey (2009) for details. In this paper, we used the orthogonalized Hotelling's deflation as mentioned 
in Mackey (2009). 
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Figure 1: Pit props: (a) cumulative variance and (b) cumulative cardinality for the first 6 
sparse principal components (PCs); (c) proportion of explained variance (PEV) 
vs. cardinality for the first sparse PC (obtained by varying the sparsity parameter 
and computing the cardinality and explained variance for the solution vector) ; (d) 
dependence of sparsity and PEV on p for the first sparse PC computed with DC- 
PCA. 



the scale on the vertical axis of Figure 2(b). Journee et al. (2008) have demonstrated that 
the greedy method proposed by d'Aspremont et al. (2008) also exhibits the behavior of 
increasing computational complexity with the increase in cardinality. 

Computational complexity vs. Problem size. Figure 3 shows the average compu- 
tation time in seconds, required by various sparse PCA algorithms, to extract the first 
sparse PC of A, for various problem sizes, n, where n is increased exponentially and p is 
fixed to 500. The times shown are averages over 100 random instances of A for each prob- 
lem size, where the sparsity inducing parameters are chosen such that the solution vectors 
of these algorithms exhibit comparable cardinality. It is clear from Figure 3 that DC-PCA 
and GPower^ scale better to large-dimensional problems than the other algorithms. Since, 
on average, GSPCA and DSPCA are much slower than the other methods, even for low car- 
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Figure 2: Random test data: (a) (average) proportion of explained variance vs. cardinality 
for the first sparse PC of A; (b) (average) computation time vs. cardinality. In 

(a) , all the sparse PC A algorithms perform similarly and better than SPCA. In 

(b) , the complexity of GSPCA grows significantly with increasing cardinality of 
the solution vector, while the speed of the other methods is almost independent 
of the cardinality. 




10 1 10 Z 10 3 10 4 



n 

Figure 3: Average computation time (seconds) for the first sparse PC of A vs. problem 
size, n, over 100 randomly generated matrices A. 



dinalities (see Figure 2(b)), we discard them from all the following numerical experiments 
that deal with large n. 

For the remaining algorithms, SPCA, GPower^ and DC-PCA, we run another round of 
experiments, now examining the computational complexity with varying n and p but with a 
fixed aspect ratio n/p = 10. The results are depicted in Table 2. Again, the corresponding 
regularization parameters are set in such a way that the solution vectors of these algorithms 
exhibit comparable cardinality. The values displayed in Table 2 correspond to the average 
running times of the algorithms on 100 random instances of A for each problem size. It can 
be seen that our proposed method, DC-PCA, is comparable to GPower^, and faster than 
SPCA. 
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Table 2: Average computation time (in seconds) for the first sparse PC associated with A 
for a fixed regularization parameter. 



p x n 


100 x 1000 


250 x 2500 


500 x 5000 


750 x 7500 


1000 x 10000 


SPCA 


0.135 


1.895 


10.256 


34.367 


87.459 


GPower£ 


0.027 


0.159 


0.310 


1.224 


1.904 


DC-PCA 


0.034 


0.151 


0.301 


1.202 


1.913 



Table 3: Gene expression datasets 



Dataset 


Samples (p) 


Genes (n) 


Reference 


Colon cancer 


62 


2000 


Alon et al. (1999) 


Leukemia 


38 


7129 


Golub et al. (1999) 


Ramaswamy 


127 


16063 


Ramaswamy et al. (2001) 



4.4.3 Gene expression data 

Gene expression data from DNA microarrays provides the expression level of thousands of 
genes across several hundreds or thousands of experiments. To enhance the interpretation 
of these large data sets, sparse PCA algorithms can be applied, to extract sparse principal 
components that involve only a few genes. 

Datasets. Usually, gene expression data is specified by a p x n matrix (say C) of p 
samples and n genes. The covariance matrix, A is therefore computed as C T C. In our 
experiments, we consider three gene expression datasets which are tabulated in Table 3. 
The colon cancer dataset (Alon et al., 1999) consists of 62 tissue samples (22 normal and 
40 cancerous) with the gene expression profiles of n = 2000 genes extracted from DNA 
microarray data. Its first principal component explains 44.96% of the total variance. The 
leukemia dataset (Golub et al., 1999) consists of a training set of 38 samples (27 ALL and 
11 AML, two variants of leukemia) from bone marrow specimens and a test set of 34 sam- 
ples (20 ALL and 14 AML). This dataset has been used widely in classification settings 
where the goal is to distinguish between two variants of leukemia. All samples have 7129 
features, each of which corresponds to a normalized gene expression value extracted from 
the microarray image. The first principal component explains 87.64% of the total variance. 
The Ramaswamy dataset has 16063 genes and 127 samples, its first principal component 
explaining 76.5% of the total variance. 

The high dimensionality of these datasets makes them suitable candidates for studying 
the performance of sparse PCA algorithms, by investigating their ability to explain variance 
in the data based on a small number of genes, to obtain interpretable results. Since DSPCA 
and GSPCA are not scalable for these large datasets, in our study, we compare DC-PCA 
to SPCA and GPower v 

Trade-off curves. Figures 4(a-c) show the proportion of explained variance versus the 
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Table 4: Computation time (in seconds) to obtain the first sparse PC, averaged over car- 
dinalities ranging from 1 to n, for the Colon cancer, Leukemia and Ramaswamy 
datasets. 





Colon cancer 


Leukemia 


Ramaswamy 


n 


2000 


7129 


16063 


SPCA 


2.057 


3.548 


38.731 


GPowerg 


0.182 


0.223 


2.337 


DC-PCA 


0.034 


0.156 


0.547 



cardinality for the first sparse PC for the datasets shown in Table 3. It can be seen that 
DC-PCA performs similar to GPower£ and performs better than SPCA. 

Computational complexity. The average computation time required by the sparse PCA 
algorithms on each dataset is shown in Table 4. The indicated times are averages over n 
computations, one for each cardinality ranging from n down to 1. The results show that 
DC-PCA and GPower£ are significantly faster than SPCA, which, for a long time, was 
widely accepted as the algorithm that can handle large datasets. 

Overall, the results in this section demonstrate that DC-PCA performs similar to GPower£ , 
the state-of-the-art, and better than SPCA, both in terms of scalability and proportion of 
variance explained vs. cardinality. We would like to mention that our sparse PCA algo- 
rithm (DC-PCA) is derived from a more general framework, that can be used to address 
other generalized eigenvalue problems as well, e.g., sparse CCA, sparse FDA, etc. 

5. Sparse Canonical Correlation Analysis 

In this section, we consider sparse CCA as a special case of the sparse GEV algorithm and 
present two CCA applications where sparsity is helpful. We call our sparse CCA algorithm 
DC-CCA, where A and B are determined from the covariance and cross-covariance matrices 
as explained right below (3). Note that A is indefinite and, therefore, in our experiments, 
we choose r = — X m i n (A) in Algorithm 1. In the following, we present two sparse CCA 
applications, one related to the task of cross-language document retrieval and the other 
dealing with semantic annotation and retrieval of music (Torres et al., 2007a, b). 

The sparse CCA algorithm considered in this section was earlier proposed by us in 
Torres et al. (2007b). Related work involves the sparse CCA algorithm due to Hardoon and 
Shawe- Taylor (2008). In Section 3, we presented a SDP relaxation, which could be applied 
for sparse CCA. However, in this section, we use DC-CCA (based on Algorithm 1) to perform 
sparse CCA, as it scales better for large problem sizes. We illustrate its performance in the 
above mentioned applications. 
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Figure 4: Trade-off curves between explained variance and cardinality for (a) Colon can- 
cer, (b) Leukemia and (c) Ramaswamy datasets. The proportion of variance 
explained is computed on the first sparse principal component, (a-c) show that 
DC-PCA performs similar to GPower^ , while explaining more variance (for a 
fixed cardinality) than SPCA. 



5.1 Cross-language document retrieval 

The problem of cross-language document retrieval involves a collection of documents, {Di}f = ^ 
with each document being represented in different languages, say English and French. The 
goal of the task is, given a query string in one language, retrieve the most relevant docu- 
ment (s) in the target language. The first step is to obtain a semantic representation of the 
documents in both languages, which models the correlation between translated versions, so 
we can detect similarities in content between the two document spaces (one for English and 
the other for French). This is exactly what CCA does by finding a low-dimensional rep- 
resentation in both languages, with maximal correlation between them. Vinokourov et al. 
(2003) used CCA to address this problem and showed that the CCA approach performs 
better than the latent semantic indexing approach used by Littman et al. (1998). CCA 
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provides an efficient basis representation (that captures the maximal correlation) for the 
two document spaces. 

Using a bag-of-words representation for the documents, sparse CCA would allow to find 
a low-dimensional model based on a small subset of words in both languages. This would 
improve the interpretability of the model and could identify small subsets of words that 
are used in similar contexts in both languages and, possibly, are translations of one an- 
other. Representing documents by their similarity to all other documents (e.g., by taking 
inner products of bag-of-word vectors, as explained below), sparse CCA would create a 
low-dimensional model that only requires to measure the similarity for a small subset of the 
training documents. This would immediately improve storage requirements and the effi- 
ciency of retrieval computations. In this study, we follow the second approach, representing 
documents by their similarity to all other training documents by applying a linear kernel 
function to a binary bag-of-words representation of the documents, as proposed in Vinok- 
ourov et al. (2003). This will illustrate how we can achieve significant sparsity without 
significant loss of retrieval performance. 

More specifically, each version of a document (English or French) is modeled using a 
bag-of-words feature vector. Within a feature vector, we associate an element in {0, 1} 
with each word Wi in its language vocabulary. A value of 1 indicates that Wi is found in 
the document. We collect the feature vectors into the N x P matrix E, where we collect 
the English feature vectors, and the N x Q matrix F, where we collect the French feature 
vectors. N is the number of documents and P and Q are the vocabulary sizes of E and 
F respectively. Computing the similarity between English documents as the inner product 
between their binary bag-of-words vectors (i.e., the rows of E) results in computing an 
N x N data matrix EE T . Similarly, we compute an N x N data matrix FF T and obtain 
two feature spaces which are both A-dimensional. 

By applying sparse CCA, we effectively perform simultaneous feature selection across 
two vector spaces and characterize the content of and correlation between English and 
French documents in an efficient manner. We use the DC-CCA algorithm, using the covari- 
ance and cross-variance matrices associated with the document matrices EE T and FF T 
and obtain successive pairs of sparse canonical components which we stack into the columns 
of Ve and Vp. (Subsequent pairs of these sparse canonical components are obtained by 
deflating EE T and FF T with respect to previous canonical components. For a detailed 
review on deflation, we refer the reader to Shawe- Taylor and Christianini (2004).) 

Then, given a query document in an input language, say English, we convert the query 
into the appropriate feature vector, gp. We project qp onto the subspace spanned by the 
sparse canonical components in the English language space by computing V^qp 9 . Similarly, 
we project all the French documents onto the subspace spanned by the sparse canonical com- 
ponents, Vf associated with the French language. Finally, we perform document retrieval 
by selecting those French documents whose projections are closest to the projected query, 
where we measure distance in a nearest neighbor sense. 



9. Notice how this projection, onto the sparse canonical components, only requires to compute a few 
elements of qs, i.e., the ones corresponding to the non-zero loadings of the sparse canonical components; 
differently said, we only need to compute the similarity of the query document to a small subset of all 
training documents. 
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Table 5: Average area under the ROC curve (in %) using CCA and sparse CCA (DC- 
CCA) in a cross-language document retrieval task, d represents the number of 
canonical components and sparsity represents the percentage of zero loadings in 
the canonical components. 



d 


100 


200 


300 


400 


500 


CCA 
DC-CCA 
Sparsity 


99.92 
95.72 
87.15 


99.93 
97.57 
87.56 


99.96 
98.45 
87.95 


99.95 
98.75 
88.21 


99.93 
99.04 
88.44 



5.1.1 Experimental Details 

The data set used was the Aligned Hansards of the 36th Parliament of Canada (Ger- 
mann, 2001), which is a collection of 1.3 million pairs of text chunks (sentences or smaller 
fragments) aligned into English and French translations. The text chunks are split into 
documents based on * * * delimiters. After removing stop words and rare words (those that 
occur less than 3 times), we are left with an 1800 x 26328 English document-by-term matrix 
and a 1800 x 30167 French matrix. Computing EE T and FF T results in matrices of size 
1800 x 1800. 

To generate a query, we select English test documents from a test set not used for 
training. The appropriate retrieval result is the corresponding French language version 
of the query document. To perform retrieval, the query and the French test documents 
are projected onto the sparse canonical components and retrieval is performed as described 
before. Table 5 shows the performance of DC-CCA (sparse CCA) against CCA. We measure 
our results using the average area under the ROC curve (average AROC). The results in 
Table 5 are shown in percentages. To go into detail, for each test query we generate an 
ROC curve from the ranked retrieval results. Results are ranked according to their projected 
feature vector's Euclidean distance from the query. The area under this ROC curve is used 
to measure performance. For example, if the first returned document was the most relevant 
(i.e., the corresponding French language version of the query document) this would result 
in an ROC with area under the curve (AROC) of 1. If the most relevant document came 
in above the 7b th percentile of all documents, this would lead to an AROC of 0.75, and so 
on. So, we're basically measuring how highly the corresponding French language document 
ranks in the retrieval results. For a collection of queries we take the simple average of each 
query's AROC to obtain the average AROC. An average AROC of 1 is best, a value of 0.5 
is as good as chance. 

In Table 5, we compare retrieval using sparse CCA to regular CCA. For sparse CCA, 
we use a sparsity parameter that leads to loadings that are approximately 10% of the full 
dimensionality, i.e., the canonical components are approximately 90% sparse. We note that 
sparse CCA is able to achieve good retrieval rates, only slightly sacrificing performance com- 
pared to regular CCA. This is the key result of this section: we can achieve performance 
close to regular CCA, by using only about 12% of the number of loadings (i.e., documents) 
required by regular CCA. This shows that sparse CCA can narrow in on the most informa- 
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tive dimensions exhibited by data and can be used as an effective dimensionality reduction 
technique. 

5.2 Vocabulary selection for music information retrieval 

In this subsection we provide a short summary of the results in Torres et al. (2007a), which 
nicely illustrate how sparse CCA can be used to improve the performance of a statistical 
musical query application, by identifying problematic query words and eliminating them 
from the model. The application involves a computer audition system (Turnbull et al., 
2008) that can annotate songs with semantically meaningful words or tags (such as, e.g., 
rock or mellow), or retrieve songs from a database, based on a semantic query. This system 
is based on a joint probabilistic model between words and acoustic signals, learned from a 
training data set of songs and song tags. "Noisy" words, that are not or only weakly related 
to the musical content, will decrease the system's performance and waste computational 
resources. Sparse CCA is employed to prune away those noisy words and improve the 
system's performance. 

The details of this experiment are beyond the scope of this work and can be found in 
Torres et al. (2007a). In short, each song from the CAL-500 dataset 10 is represented in two 
different spaces: in a semantic space, based on a bag-of-words representation of a song's 
semantic tags, and in an audio space, based on Mel-frequency cepstral coefficients (Mckin- 
ney, 2003) extracted from a song's audio content. This representation allows sparse CCA to 
identify a small subset of words spanning a semantic subspace that is highly correlated with 
audio content. In Figure 5, we use sparse CCA to generate a sequence of vocabularies of 
progressively smaller size, ranging from full size (containing about 180 words) to very sparse 
(containing about 20 words), depicted on the horizontal axis. For each vocabulary size, the 
computer audition system is trained and the average area under the receiver operating char- 
acteristic curve (AROC) is shown on the vertical axis, measuring its retrieval performance 
on an independent test set. The AROC (ranging between 0.5 for randomly ranked retrieval 
results and 1.0 for a perfect ranking) initially clearly improves, as sparse CCA (DC-CCA) 
generates vocabularies of smaller size: it is effectively removing noisy words that are detri- 
mental for the system's performance. Also shown in Figure 5 are the results of training the 
music retrieval system based on two alternative vocabulary selection techniques: random 
selection (offering no improvement) and a heuristic that eliminates words exhibiting less 
agreement amongst the human subjects that were surveyed to collect the CAL-500 dataset 
(only offering a slight improvement, initially). 

In summary, Torres et al. (2007a) illustrates that vocabulary selection using sparse 
CCA significantly improves the retrieval performance of a computer audition system (by 
effectively removing noisy words) , outperforming a random baseline and a human agreement 
heuristic. 



10. The CAL-500 data set consists of a set of songs, annotated with semantic tags, obtained by conducting 
human surveys. More details can be found in Turnbull et al. (2008). 
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Figure 5: Comparison of vocabulary selection techniques for music retrieval. 



6. Sparse Fisher Discriminant Analysis 

In this section, we show that the FDA problem is an interesting special case of the GEV 
problem and that the special structure of A allows the sparse FDA problem to be solved 
more efficiently than the general sparse GEV problem. 

Let us consider the GEV problem in (GEV-P) with A G S+, B G S+ + and rank(A) = 1. 
This is exactly the FDA problem as shown in (4) where A is of the form A = aa T , with 
a = (fii — /12) £ M n . The corresponding GEV problem is written as 

X max (A, B) = max (a T x) 2 

X 

s.t. x T Bx = 1, (43) 

( T \2 

which can also be written as X max (A, B) = max^^o x T Bx ' ^ mce we are primarily inter- 
ested in the maximizer of (43), we can rewrite it as 

min^^ — -7T = min{x T Bx : a T x = 1|. (44) 
x^o (a T x) 2 

The advantage of the formulation in (44) will become clear when we consider its sparse 
version, i.e., after introducing the constraint {x : \\x\\q < k} in (44). Clearly, introducing 
the sparsity constraint makes the problem intractable. However, introducing an ^i-norm 
relaxation in this formulation gives rise to a convex program, 

min{a; T Sa; : a T x = 1, ||a?||i < k}, (45) 

more specifically a quadratic program (QP), and the corresponding penalized version is 
given by 

mm{x T Bx + ^||a;||i : a T x = l}, (46) 
where v > is the regularization parameter. 
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Note that a transformation similar to the one leading to (44) can be performed for the 
GEV problem with any, general A 6 i.e., writing the GEV problem as a minimization 
problem, 



min x T Bx 

X 

s.t. x T Ax = 1. (47) 

This formulation, however, is not useful to simplify solving a GEV problem in general. 
Indeed, consider the sparse version of the problem in (47) with the sparsity constraint 
{x : \\x\\q < k} relaxed to {x : \\x\\i < k}. Because of the quadratic equality constraint, the 
resulting program is non-convex for any A. Suppose that the constraint set {x : x T Ax = 1} 
is relaxed to {x : x T Ax < 1}. If A € S n \S™, the program is still non-convex as the 
constraint defines a non-convex set. If A € S", then the optimum occurs at x = 0. 
Therefore, the minimization formulation of the GEV problem in (47) is not useful, unlike 
the case where AgS™ and rank(A) = 1. 

Based on the discussion so far, it is clear that the sparse FDA problem can be solved 
as a convex QP, which is significantly more efficient than, e.g., an SDP relaxation as in (7) 
for sparse PCA or sparse CCA. Suppose that one would like to use a better approximation 
to ||cc||o than ||x||i, for sparse FDA. Using the approximation we considered in this work, 
(45) reduces to 



min x T Bx + v £ log(e + \xi\) 

X 

1=1 

s.t. a T x = 1, (48) 



where v £ := v j log(l + e 1 ). Applying the MM method to the above program results in the 
following iterative scheme, 



i=l 

T 



aj(' +1 ) = arg min x T T 

* \x)" J \ + e 



s.t. a 1 x = l, (49) 

which is a sequence of QPs unlike Algorithm 1, which is a sequence of QCQPs. The nice 
structure of A makes the corresponding sparse GEV problem computationally efficient. 
Therefore, one should solve the sparse FDA problem by using (45) or (49) instead of using 
the convex SDP in (7) or Algorithm 1. 

Suykens et al. (2002, Section 3.3) and Mika et al. (2001, Proposition 1) have shown 
connections between the FDA formulation in (44) with a = — and B = Si + S2 (see 
paragraph below (4) for details) and least-squares support vector machines (classifiers that 
minimize the squared loss, see Suykens et al. (2002, Chapter 3)). Therefore, sparse FDA is 
equivalent to feature selection with least-squares support vector machines. In other words, 
(45) is equivalent to LASSO, while the formulation in (48) is similar to the one considered 
in Weston et al. (2003). Since these are well studied problems, we do not pursue further 
showing the numerical performance of sparse FDA. 
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7. Conclusion and Discussion 

We study the problem of finding sparse eigenvectors for generalized eigenvalue problems. 
After proposing a non-convex but tight approximation to the cardinality constraint, we 
formulate the resulting optimization problem as a d.c. program and derive an iterative 
algorithm, based on the majorization-minimization method. This results in solving a se- 
quence of quadratically constrained quadratic programs, an algorithm which exhibits global 
convergence behavior, as we show. We also derive sparse PCA (DC-PCA) and sparse CCA 
(DC-CCA) algorithms as special cases of our proposed algorithm. Empirical results demon- 
strate the performance of the proposed algorithm for sparse PCA and sparse CCA applica- 
tions. In the case of sparse PCA, we experimentally demonstrate on both benchmark and 
real-life datasets of varying dimensionality that the proposed algorithm (DC-PCA) explains 
more variance with sparser features than SPCA (Zou et al., 2006) while performing simi- 
larly to DSPCA (d'Aspremont et al., 2007) and GSPCA (Moghaddam et al, 2007) at better 
computational speed (lower CPU time). On the other hand, DC-PCA has performance and 
scalability similar to that of the state-of-the-art GPower^,, algorithm. We also illustrate 
the practical relevance of the sparse CCA algorithm in two applications: cross-language 
document retrieval and vocabulary selection for music information retrieval. 

The proposed algorithm does not allow to set the regularization parameter a priori, 
to guarantee a given sparsity level. This is similar for SPCA and GPower£ . SDP-based 
relaxation methods, on the other hand (e.g., DSPCA in the context of sparse PCA) are 
better suited to achieve a given sparsity level in one shot, by incorporating an explicit 
constraint on the sparsity of the solution (although, eventually, through relaxation, an 
approximation of the original problem is solved). Since the algorithm we propose solves a 
LASSO problem in each step but with a quadratic constraint, one could explore using a 
modified version of path following techniques like least angle regression (Efron et al. , 2004) 
to learn the entire regularization path. 
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Appendix A. Derivation of the SDP relaxation in (7) 

The idea in deriving the SDP relaxation in (7) is to start with the approximate program 
in (5) and then derive its bi-dual (dual of the dual). Though (7) is not a canonical convex 
program, its Lagrangian dual is always convex. Therefore, obtaining the dual program of 
this dual provides a convex approximation to (7), which is what we derive below. 
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Consider the £i-norm relaxed sparse GEV problem in (5), which we reproduce here for 
convenience. 

max x T Ax 

s.t. x T Bx < 1, ||a;||i < k. (50) 

The above problem can be re-written as 

max x T Ax 
x,y 

s.t. x T Bx < 1, — y -< x -< y 

y T l < k. (51) 

The corresponding Lagrangian dual problem is given by 

min max L(x,y, 8, u,u,s), 

/3>0,^>0 x,y 

where 

L(x, y, 8, n, u, s) = x T Ax - fi(x T Bx - 1) - 8{y T l - k) - u T (x - y) + s T (x + y). (52) 

Let us first maximize L over x. By Lemma 3.6 of Lemarechal and Oustry (1999), the 
necessary and sufficient condition for Q(x) = x T (A — jjlB)x + x T (s — u) to have a finite 
upper bound over R n is fiB — A y and s — u G 1Z(/j,B — A). Differentiating L w.r.t. 
x yields x = \([iB — Ay(s — u). Similarly, while maximizing L w.r.t. y, the necessary 
and sufficient condition for R{y) = y T {s + u — 81) to have a finite upper bound over W 1 is 
s + u = 81. Therefore, the dual program can be written as 

min -(u - s) T (fiB - A) f (w - s) + 8k + fi 

s.t. fj,B-At.O, u-s£ TZ(iiB - A) 

s + u = 81, 8> 0, /x> 0, u>: 0, s y 0, (53) 

which is equivalent to 

min -r T (fiB — A)V + 8k + fi 

r,/3,[i 4 

s.t. fiB - A y 0, r G K(fiB - A) 

-81 < r ^ 81, 8 > 0, n > 0. (54) 

By invoking the Schur's complement lemma, the dual can be written as 

min t + 8k + [i 

s.t. -Pllripi,/3>0,n>0 



fiB-A -\r 



i_ r T 1 )h0. (55) 
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The bi-dual associated with (50) is obtained by computing the dual of (55) given by 

max min L(r,t, 3, u, d>,a,9,r, X ,x,n). (56) 
<pm,a>o,e>o r yo,tm 
Tto,Tito,xyo /3>o,At>o 

Here L is the Lagrangian associated with (55), given by 

L(r,t,P,n,<p,a,6,r,X,x,T}) = t + 3k + fi + rj T (r - 31) - r T (r + 31) - an - 90 

() , X x \f \iB — A -\r 



x T </> )\ -\r T t 
= tr(XA) + n(l - a - tr(XB)) + t(l - 0) 

+3{k-rj T l-T T l-6) + r T (r]-T + x). (57) 

Minimizing the above Lagrangian results in 

max tr(XA) 

a,9,T,r},x,X 

s.t. a + ti(XB) = 1, x + r) = r, (rj + t) t 1 + d = k 



which is equivalent to 



max tr(XA) 

x,X 

s.t. ti(XB) < 1, ||a;||i < k 
X x 
x T 1 



T 1 J h 0, (59) 



as shown in (7). 

Appendix B. Alternative derivation of (ALG) 

(ALG) can be derived differently by starting with (13) and applying the linear majorization 
idea (see Example 1). 

Consider the d.c. program in (13), which is of the form min Xy y(u(x, y) — v(x, y)) where 
u(x,y) = In(x,y) + t\\x\\% and v(x,y) = x T (A + rl n )x - p £ J2i=i lo s(yi + s) with ft = 
{{x,y) : x T Bx < 1, — y < x ^ y}. Here Iq represents the indicator function of the convex 
set f2 given by 

0, (i,y)en 
oo, otherwise 



In(x,y) 



It is easy to check that u and v are convex. Therefore, by (18) in Example 1, the MM 
algorithm gives 



(a-C+i^ = argmin t\\x\\ 2 2 - 2x T (A + rl n )x^ + p £ V 

x,y ^-^ 



Vi 



,(0 



i=i v\ + e 

s.t. cc r 5a; < 1, -y ^ x ■< y, (60) 



which is equivalent to (ALG). 
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Appendix C. Derivation of (ALG-S) 

Suppose A y and B = I n . Since A y 0, r can be chosen as zero. Using r = in (ALG), 
we have that is the maximizer of the following program: 



max x T Ax^ — — 

X T X<1 2 



x 



= max ~S^Xi(Ax®)i — 

1 x T x<l f-f 



— w)'\Xi\. 



(61) 



Consider the r.h.s. of (61). Since it is the maximization of a linear objective over a convex 
set, the unique optimum lies on the boundary of the convex set (Rockafellar, 1970, Theorem 
32.1). The Lagrangian associated with the program in the r.h.s. of (61) is given by 



L(x, A) = ^Tx l (Ax^) l - ^wf\ Xi \ - A J>f, 

i=l i=l 

where A > 0. Differentiating L w.r.t. x-i and setting it to zero yields 



(Ax®)i - f ™ W sign(^) 



2A 



Therefore, we have 



Xi = < 



(AcW)i + ^^° 



2A 
0, 



otherwise 



which is equivalently written as 



KAx^ 



2 w i 



sign((Ax®)i) 



Xj, — 



2A 



Since Yli=i x 1 = substituting for x% as given in (65) yields 



\(Ax(%\ - f sign((Aa;«).) 



(0 



and therefore in (ALG-S) follows. 



(62) 



(63) 



(64) 



(65) 
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